feat: use nh3 for HTML sanitization#10264
feat: use nh3 for HTML sanitization#10264nouralmaa wants to merge 30 commits intoietf-tools:mainfrom
Conversation
ietf/utils/html.py
Outdated
| # Allow the protocols/tags/attributes we specifically want, plus anything that nh3 declares | ||
| # to be safe. | ||
|
|
||
| acceptable_protocols = nh3.ALLOWED_URL_SCHEMES.union( |
There was a problem hiding this comment.
TODO check. this is more permissive than bleach's strict superset. however, unsure if this adds extra vulnerabilities since nh3 uses a better maintained parser https://github.com/rust-ammonia/ammonia
There was a problem hiding this comment.
@jennifer-richards @rjsparks any thoughts?
There was a problem hiding this comment.
I'm going to try to get a few other eyes on this
There was a problem hiding this comment.
I think adding tel: and ftp: (really?) are fine from a security perspective.
There was a problem hiding this comment.
We don't really need ftp, but tel might be useful.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #10264 +/- ##
=======================================
Coverage 88.36% 88.36%
=======================================
Files 325 325
Lines 43653 43652 -1
=======================================
Hits 38573 38573
+ Misses 5080 5079 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ietf/utils/html.py
Outdated
| "li", "ol", "p", "pre", "q", "s", "samp", "small", "span", "strike", "style", | ||
| "li", "ol", "p", "pre", "q", "s", "samp", "small", "span", "strike", | ||
| "strong", "sub", "sup", "table", "title", "tbody", "td", "tfoot", "th", "thead", | ||
| "tr", "tt", "u", "ul", "var" |
There was a problem hiding this comment.
xmp is also good
| "tr", "tt", "u", "ul", "var" | |
| "tr", "tt", "u", "ul", "var", "xmp" |
ietf/utils/html.py
Outdated
| # to be safe. | ||
|
|
||
| acceptable_protocols = nh3.ALLOWED_URL_SCHEMES.union( | ||
| {"http", "https", "mailto", "ftp", "xmpp"} |
There was a problem hiding this comment.
| {"http", "https", "mailto", "ftp", "xmpp"} | |
| {"ftp", "http", "https", "mailto", "tel", "xmpp"} |
Sort. Add "tel".
(If performance is dictated by order, move "https" to the top.)
ietf/utils/html.py
Outdated
| protocols=acceptable_protocols, | ||
| strip=True, | ||
| url_schemes=acceptable_protocols, | ||
| link_rel=None |
There was a problem hiding this comment.
This doesn't seem right to me. The default value is safer. Are there cases where links opened will need window.opener? I can't imagine that being necessary for user-generated content.
Same below.
There was a problem hiding this comment.
cool, changed it here as there were no rel attributes before
ietf/utils/html.py
Outdated
| protocols=acceptable_protocols, | ||
| strip=True, | ||
| _liberal_nh3_cleaner = nh3.Cleaner( | ||
| tags=acceptable_tags.union({"mg", "figure", "figcaption"}), |
There was a problem hiding this comment.
typo?
| tags=acceptable_tags.union({"mg", "figure", "figcaption"}), | |
| tags=acceptable_tags.union({"img", "figure", "figcaption"}), |
ietf/utils/html.py
Outdated
| """Returns the given HTML sanitized, and with the given tags removed.""" | ||
| allowed = acceptable_tags - set(t.lower() for t in tags) | ||
| return bleach.clean(html, tags=allowed, strip=True) | ||
| return nh3.clean(html, tags=allowed) |
There was a problem hiding this comment.
You might want to audit invocations of this function.
There was a problem hiding this comment.
Not sure if @martinthomson did so himself, but it looks to me like the only use is via the removetags filter in htmlfilters.py, which is not used anywhere (I also checked the DBTemplates)
If so, we can just lose this method entirely.
If not, I think this changes remove_... to escape_... because nh3 doesn't have a strip option (but that's just based on reading docs)
There was a problem hiding this comment.
Doesn't seem to raise any validation errors
821b2a9 to
bf72278
Compare
|
Hi @nouralmaa - Thanks for continuing to work on this. One point on something that's not obvious - commits bf72278 and 9736b74 touch an old data migration (see the filename) - those will never run again, and its better to not touch these migration files as they should show what happened when they actually ran on the production database. If it's easy, please revert those commits. If it's not, we'll do it when this gets towards final review. When we have major changes to django (we may have one coming with django5) we tend to squash the migrations and that one would go away at that time. |
ietf/utils/templatetags/tests.py
Outdated
| '<a href="https://www.ietf.org" rel="nofollow">https://www.ietf.org</a>', | ||
| ) | ||
| self.assertEqual( | ||
| linkify("https://mailman3.ietf.org/mailman3/lists/[email protected]/"), |
| value = conditional_escape(value) | ||
| text = mark_safe(_linkify(value)) # _linkify is a safe operation | ||
| value = urlize(value, autoescape=True) # _linkify is a safe operation | ||
| text = mark_safe(value) |
There was a problem hiding this comment.
This won't work as written - it needs to call urlize() in all cases, not only when autoescape is true. However,
- The commit log around
ietf.doc.templatetags.ietf_filters.urlizementions issues with Django'surlizeand its handling of adjacent parentheses. Need to investigate that and whether it's been fixed. - If it is ok to switch to
urlize, we should just do away withlinkifyas it's redundant.
There was a problem hiding this comment.
have adjusted so this can be called in both cases.
- re the handling of adjacent parentheses and this commit, I can understand why django's urlizer might not be ideal for linkification like
[REF](http://example.com/foo), and the issue still stands. however, I think the markdown filter insideietf.utils.templatetags.htmlfilterscan be used in such cases with links intact - yes, but it would require changing lots of internal html templates and tests to do away with the linkify filter completely. maybe can be raised in a separate issue? can also try to have a look soon.
|
|
||
| def test_view_status_update(self): | ||
| chair = RoleFactory(name_id='chair',group__type_id='wg') | ||
| event = GroupEventFactory(type='status_update',group=chair.group) |
There was a problem hiding this comment.
Unexpected change. Unless there's a reason not to test this behavior, need to fix the assertion below rather than drop the test.
ietf/utils/html.py
Outdated
| """Returns the given HTML sanitized, and with the given tags removed.""" | ||
| allowed = acceptable_tags - set(t.lower() for t in tags) | ||
| return bleach.clean(html, tags=allowed, strip=True) | ||
| return nh3.clean(html, tags=allowed) |
There was a problem hiding this comment.
Not sure if @martinthomson did so himself, but it looks to me like the only use is via the removetags filter in htmlfilters.py, which is not used anywhere (I also checked the DBTemplates)
If so, we can just lose this method entirely.
If not, I think this changes remove_... to escape_... because nh3 doesn't have a strip option (but that's just based on reading docs)
re these commits, I've reverted the changes and also made a change to use html instead of having the links hardcoded into the slugs for that template. I imagine that may be necessary down the road in any case, but feel free to revert if not. |

fixes #10138